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Abstract: We propose a new statistical model for computational linguis- 
tics. Rather than trying to estimate directly the probability distribution of 
a random sentence of the language, we define a Markov chain on finite sets 
of sentences with many finite recurrent communicating classes and define 
our language model as the invariant probability measures of the chain on 
each recurrent communicating class. This Markov chain, that we call a com- 
munication model, recombines at each step randomly the set of sentences 
forming its current state, using some grammar rules. When the grammar 
rules are fixed and known in advance instead of being estimated on the 
fly, we can prove supplementary mathematical properties. In particular, we 
can prove in this case that all states are recurrent states, so that the chain 
defines a partition of its state space into finite recurrent communicating 
classes. We show that our approach is a decisive departure from Markov 
models at the sentence level and discuss its relationships with Context Free 
Grammars. Although the toric grammars we use are closely related to Con- 
text Free Grammars, the way we generate the language from the grammar 
is qualitatively different. Our communication model has two purposes. On 
the one hand, it is used to define indirectly the probability distribution of 
a random sentence of the language. On the other hand it can serve as a 
(crude) model of language transmission from one speaker to another speaker 
through the communication of a (large) set of sentences. 
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1. Introduction to a new communication model 

In the well known kernel approach to density estimation on a measurable space 
S£ ', the probability distribution P of a random variable X £ 5£ is estimated 
from a sample (X\, . . . , X n ) of n independent copies of X as — ^C^ii ^x), 

where k is a suitable Markov kernel. This kernel estimate can be seen as a 
modification of the empirical measure P = ^ Y^i= 1 

In the context of natural language modeling at the sentence level, X is the 
set of sentences, that is the set of sequences of words of finite length. 

Finding sensible kernel estimates or sensible parametric models in this context 
is a challenge. Therefore, we propose here another route, that we will describe 
as an alternative way of producing a modification of the empirical measure. The 
idea is to recombine repeatedly a set of sentences. Let us describe for this a 
general framework, concerned with an arbitrary countable state space . 
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Let & n = < — $xi, %i £ 2£\ be the set of empirical measures of all 

^ i—l ' 

possible samples of size n. Let us consider a parametric family {qe,9 £ 6} of 
Markov kernels on 2? n . Let us assume for simplicity that for any P £ £P ni the 
reachable set {Q £ £P n , J2ten 1e(^ Q) > 0} is finite, where q l g is q e composed 
t times with itself, so that for instance qg(P,Q) = J2p>e^p Qe{P, P')qe{P' ,Q)- 
In this case we can define the Markov kernel 

k 

q e {P,Q)=\im Jy>J(P,Q). 

k — y oo fx, 

*=i 

It is such that for any P £ £P n , q$(P,-) is an invariant measure of qg. More 
generally qeqe — qeqe = qe- The distribution q$(P,-) £ induces a 

marginal distribution Qg,p on through the formula 

Qe,p= ]T qe{P,Q)Q. (1.1) 

In this paper, we will be concerned with estimators of the form P = Qg p, if 
9 is fixed in advance, or of the form Q^, if 9 is an estimator of the parameter 

& ,ir 

9 depending also on P. 

Another interpretation of our framework is to consider qg as a communi- 
cation model. One speaker hears a set of sentences described by its empirical 
distribution P G 2? n (which means that he will not make use of the special 
order in which he has heard them). He uses those sentences to learn the cor- 
responding language. Then he teaches another speaker what he has learnt by 
outputting another random set of sentences, distributed according to qg(P,-). 
The language model (as opposed to the communication model qg), is Qe,p, the 
average sentence distribution along an infinite chain of communicating speakers. 
If we start from a recurrent state P, and we assume that 9 is known, we obtain a 
communication model where the target sentence distribution Qe,p can be learnt 
without error from the set of sentences output by any involved speaker. Indeed 
Qe,p — Qe,Q for any Q in the communicating class of P, which in this situation 
is also the reachable set from P. 

This error free estimation behaviour is desirable for a communication model. 
It tells us that the language can be transmitted from speaker to speaker without 
distortion, a desirable feature in the case of a large number of speakers. The 
model may also account for weak stimulus learning, the fact that human beings 
learn language through a limited number of examples compared with the variety 
of new sentences they are able to formulate. Indeed, whereas the size of the 
support of P £ J^n, the number of sentences heard by one speaker, is constant 
and equal to n, the support of the language model Qe,p may be much larger. We 
will actually give a toy example where the number of sentences in the language 
is exponential with n. 
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In the language transmission interpretation, we may evaluate the interest of 
the model by studying whether it can model a large family of sentence distri- 
butions. This richness will depend on the number of recurrent communicating 
classes of the communication Markov model qg, since any invariant distribution 
q$(P, •) is a convex combination of the unique invariant measures supported by 
each recurrent communicating class. The situation is even simpler in the case 
when all P £ £P n are recurrent states (a fact we will be able to prove in our 
particular model). In this case qg(P, ■) is the unique invariant measure supported 
by the recurrent communicating class to which P belongs. 

In this paper we will focus on the construction and mathematical properties 
of the communication model qg. We will also touch on the estimation problem 
stated in the opening of this introduction by providing some estimator Q-^-p 

where 0(P) is an estimator of the parameter computed on the observed sam- 
ple. However we will leave the mathematical properties of this estimator for 
further studies. We will be content with providing some promising preliminary 
experiments computed on a small sample and will share with the reader some 
qualitative explanations of its behaviour. 

The parameter 9 of our model will be a new kind of grammar, closely related 
to Context Free Grammars, but used to generate sentences in a different way. 

2. Toric grammars 

Now that we have explained our general framework based on a communication 
Markov kernel qg defined on empirical distributions, let us come to natural 
language modeling more specifically, and describe a dedicated family of kernels. 

Natural language processing in linguistics has been using more and more 
elaborate mathematical tools (a brief presentation of some of them is given by 
E. Stabler in [14]). The n-gram models are widely used, although they fail to 
grasp the recursive nature of natural languages, and do not use the syntactic 
properties of sentences. Efforts have been made to improve the performance of 
these models, by introducing syntax, (Delia Pietra et al. 1994 [9]; Roark 2001 
[12]; Tan et al. 2012 [15]). One way to do this is to use Context Free Grammars, 
also named phrase structure grammars, introduced by N. Chomsky as possible 
models for the logical structure of natural languages (see for example [4-6]), 
and their probabilistic variants (Chi 1999 [3]). Our proposal follows this trend, 
but with the goal to separate ourselves from classic n-grams, seeing syntax as 
equivalence classes between constituents, which we try to discover. 

We consider some dictionary of words D. Each statistical sample, as explained 
in the introduction, is made of a set of sentences. Each sentence is a sequence 
of words of D. The sentences may be of variable length. 

To simplify notations we will use non normalized empirical measures. Thus, 
the state space of the communication Markov kernel qg will be 
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where we introduce the notation 

oo 

D+ = (J D j . 

We will call the set of texts of length n. Let us notice that for us, texts 
are unordered sets of sentences. The question of generating meaningful ordered 
sequences of sentences is also of interest, but will not be addressed in this study. 

In order to define the communication kernel, we will describe random trans- 
formations on texts, related to the notion of Context Free Grammars. Let us 
start with an informal presentation. The communication kernel will perform 
random recombinations of sentences. 

Our point of view is to see a Context Free Grammar as the result of some 
fragmentation process applied to a set of sentences. Let us explain this on a 
simple example. Consider the sentence 

This is my friend Peter. 

Imagine we would like to represent this sentence as the result of pasting the 
expression my friend in its context, because we think language is built by cutting 
and pasting expressions drawn from some large set of memorized sentences. We 
can do this by introducing the simple Context Free Grammar 

[O] -> This is |T| Peter . 
[T] — > my friend 

where we have used numbered framed boxes for non terminal symbols, the start 



symbol being . The two rules mean that we can rewrite the start symbol 
to obtain the right-hand side of the first rule, and that we can then rewrite the 
non terminal symbol [T] as the right-hand side of the second rule. 

Since we want to see the rules of the grammar as the result of some splitting 
operation, we are going to use more symmetric notations. Instead of considering 
that we have described our original sentence with the help of two rules and two 
non terminal symbols [o] and [T], we may as well consider that we have split 
our original sentence into two new sentences using three non terminal symbols, 
namely — >, and [T] — h To emphasize this interpretation, we can adopt 



more symmetric notations and write these three non terminal symbols as [o, ]i 
and [i. With these new notations, the representation of our original sentence is 
now 

[o This is }i Peter . 
[i my friend 

In this new representation, the rewriting rules can be replaced by merge opera- 
tions of the type 

a]iC + [i b i-» abc 
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We can make this merge operations even more symmetric, if we consider that 
each expression can be represented by any of its circular permutations. Indeed, 
each expression contains exactly one non terminal symbol of the form and 
therefore is uniquely defined by any of its circular permutations (since, due to 
this feature, we can define the permutation in which the opening bracket [, comes 
first as the canonical form, and recover it from any other circular permutation). 
Using this convention, we can write a ],c as ca ], and describe the merge operation 
as 

ca]i + [i b i— > cab, 
or, renaming ca as a simply as 

a]i + [ib h-> ab. 

Let us formalize what we have explained so far. Let D be some dictionary of 
words (which can be for the sake of this mathematical description any finite set, 
representing the words of the natural language to be modeled). Let us form the 
symbol set S = D U {[j, ],, i G IN} . Let us define the set of circular permutations 
of a sequence of symbols as 

6(w a , . . .,w t -i) = {(w (l+j mod e),i = 0, ...,£- = 0, ...,£- l}, 

so that for instance 6(^0,^1,^3) = {wqWiW2, wijd 2 wq, W2W0W1}, and its sup- 
port (the set of symbols included in the sequence) as 

supp(w , . . .,w t -i) = {wo, • • -,we-i}- 

Let A+ = U£Ti An , ]+ = { h i e IN \ {0}}, and consider the set of expressions 

<?={ee6([ i a), (DU}+) + \] + }. 

In plain words, an expression is a circular permutation of a finite sequence of 
symbols starting with an opening bracket, containing no other opening bracket 
and not reduced to an opening bracket followed by a closing bracket. 

This definition mirrors the fact that a given rule of a Context Free Grammar 
has exactly one [7] — > (the left side), and the right side of the rule cannot be 
just a non terminal symbol j^Jj. Indeed, if we had allowed [T] — > or with our 
notations [j ] j , we could as well have replaced i by j everywhere. 

Definition 2.1 

The set of toric grammars is the set 25 of positive measures & on § with finite 
support such that for any circular permutation e' G 6(e) of any expression e G 
S, Sf(e') =S?(e). 

In other words, a toric grammar 'S is a positive measure with finite support 
on the set of expressions § satisfying 



SP(e) - |6(e)r 1 ^(6(e)). 
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Let us remark that, in our definition of toric grammars, on top of choosing 
some special notations for Context Free Grammars, we also introduced positive 
weights, so that it is more the support of a toric grammar than the grammar 
itself that corresponds to the usual notion of Context Free Grammar. 

The weights will serve to keep track of word frequencies through the process 
of splitting a set of sentences to obtain a toric grammar. 

Our aim is indeed to build a toric grammar from a text. To be consistent 
with our definition of grammars, we will also define texts as positive measures. 
Let us give a formal definition. We will forget the sentence order, a text will be 
an unordered set of sentences with possible repetitions. 

Definition 2.2 

The set T of texts is the set of toric grammars with integer weights supported 
by & ([o D + ), that is the set of toric grammars with integer weights using only 
one non terminal symbol, the start symbol [ . 

In this definition, it should be understood that 
[oD + = {([o, wi, . . . ,Wfe), where fc € IN \ {0} and w, e D, 1 < « ^ fc}, 
and that 

&([ D+)= |J 6(e). 

ee[ a D+ 

3. A roadmap towards a communication model 

We will use toric grammars as intermediate steps to define the transition prob- 
abilities of our communication model on texts. To this purpose, we will first 
introduce some general types of transformations on toric grammars (reminding 
the reader that in our formalism texts are some special subset of toric gram- 
mars). 

It will turn out that two types of expressions, global expressions and local 
expressions, will play different roles. Let us define them respectively as 

4 = ^6([ S+), 
<ft = «?n6([+ s + ), 

where we remind that [ + = 6 IN \ {0}} and S+ = \J°° =1 S j . Any toric 

grammar Sf £ © can be accordingly decomposed into <S = @ g + 5^, where 
%{A) = <3{A n S g ) and %{A) = <S{A n S t ), for any subset Ad S. 

The transitions of the communication chain with kernel qe{^ , will be 
defined in two steps. The first step consists in learning from the text 3~ a toric 
grammar 'S . To this purpose we will split the sentences of S? into syntactic 
constituents. The second step consists in merging the constituents again to 
produce a random new text £?' . The parameter 6 — M of the communication 
kernel qg, will also be a toric grammar. The role of this reference grammar 
will be to provide a stock of local expressions to be used when computing 'S 
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from We will discuss later the question of the estimation of itself. For the 
time being, we will assume that the reference grammar ffl is a parameter of the 
communication chain, known to all involved speakers. 

We could have defined a communication kernel q^.^AST , where the 

reference grammar itself is estimated at each step from the current text 

& , but we would have obtained a model with weaker properties, where, in 
particular, all the states are not necessarily recurrent states. On the other hand, 
the proof that the reachable set from any starting point is finite still holds for 
this modified model, so that it does provide an alternative way of defining a 
language model as described in the introduction. 

We will still need an estimator of the reference grammar, in order to 

provide a language estimator Q^,^s where we are using the notations of 

cq. (1.1) on page 2. The estimation of the reference grammar will be 

achieved by running some fragmentation process on the text JeT. 

4. Non stochastic syntax splitting and merging 

Let us now describe the model, starting with the description of some non random 
grammar transformations. We already introduced a model for grammars that 
includes texts as a special case. We have now to describe how to generate a toric 
grammar from a text, with, or without, the help of a reference grammar to learn 
the local component of the grammar. The mechanism producing a grammar from 
a text will be some sort of random parse algorithm (or rather tentative parse 
algorithm) . 

All of this will be achieved by two transformations on toric grammars that 
will respectively split and merge expressions (syntagms) of a toric grammar into 
smaller or bigger ones. We will first describe the sets of possible splits and merges 
from a given grammar. This will serve as a basis to define random transitions 
from one grammar to another in subsequent sections. 

Let us first introduce some elementary operations involving toric grammars. 

see(e) see(f) 

eef= ]T 6 S - 8„ e,fe£, 

see(e) see(f) 

p®e = p peR, e e ff, 

s<=6(e) 

The first operation builds a toric grammar containing expressions e and / with 
weights 1, and the third one builds a toric grammar containing expression e 
with weight p. 

We can generalize these notations to be able to take the sum of a toric 
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grammar and an expression, as well as the sum of two toric grammars. 

see(e) 

Sf©e = Sf- 5 s, SfG<S,eG<T 

see(e) 

gfffiSf' = & + sf.sf'e®. 
With these notations, a split is described as 

sf' = sf e a& © a]* © [;&, sf.sf'ee, 

the fact that Sf , Sf' G <3 implying that 

ieM\{0},a6,a]i,[i6G^and^(a6) > 1. 
The (partial) order relation Sf ^ Sf ' will also be defined by the rule 

Sf s?Sf' <^=> Sf'-Sf G<3, 

or equivalently 

sf < sf' <^=> sf - sf g 

Let us resume our example. Starting from the one sentence text 

^ = 1 <g> [o 77ms is my friend Peter . 
we get after splitting the grammar 

Sf = [o 77ms is ] i Peter . © [i my friend 
which can also be written as 

Sf = Peter . [g This is ] i © [i my friend 

In this example, as well as in the following, punctuation marks are treated as 
words, so that here the required dictionary has to include { is, my, friend, Peter, 
This, . }. 

Splitting a sentence providing a new label for each split does not create 
generalization, since it allows only to merge back two expressions that came 
from the same split. To create a grammar capable of yielding new sentences, 
we need some label identification scheme. We will perform label identification 
through the more general process of label remapping, identification being a 
consequence of the fact that the map may not be one to one. Let 

S = {/ : IN -> IN such that /(0) = 0} 

be the set of label maps. For any symbol ]j or [j, i G IN, let us define /( ]j) =]/(«) 
and f([i ) = [f(i)- Let us also define for any word w G D, f(w) — w and 
for any expression e = (wo, ■ ■ ■ , wt-i), /(e) = (f(wo), • • • , f(we-i)). Since any 
grammar Sf G is a measure on the set of expressions S ', we can define its image 
measure by /, considered as a map from S to S . We will put /(Sf) = Sf o / _1 , 
meaning that /(Sf)(A) = Sf (/ _1 (A)), for any subset ACS'. 
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Definition 4.1 

Two label maps f and j 6 5 are said to be isomorphic if there is a one to one 
label map h G $ such that g = h o f . In this case h^ 1 G $ and f = h^ 1 o g. 
Two grammars and ( S' G © are said to be isomorphic if there is a one to 
one label map / € 5 such that f(^) — ( S' . In this case, = & and we 

will write = <$' . If f and g are two isomorphic label maps, then for any toric 
grammar G ©, f{^) and g{^) are isomorphic grammars. In the following 
of this paper, to ease notations and simplify exposition, we will freely identify 
isomorphic label maps and isomorphic grammars and often speak of them as if 
they were equal. 

This being put, we proceed with the introduction of a set of grammar trans- 
formations j3 that consist in a split with possible label remapping. The split will 
be the core component for generating a toric grammar from a text, by splitting 
the sentences in smaller parts (syntagms). 

Definition 4.2 (Splitting rule) 

For any G <3, let us consider 

13(%) = {/(#')./ e£S?'e 0,Sf' = Sf e ab © a], © C 0. 

Let us remark that in this definition, necessarily, ab,a]i, [i b G S, i G IN \ {0} 7 
1 <g> ab < & , and a]i [, b < Sf'. Let us put 

/3*(S?)= (J^o.-.o/m 
n =o" — 7r ' 

n times 

the set of grammars that can be constructed from repeated invocations of (3. 
Lemma 4.1 

Let us recall that S = Du{[ t , ]j,ie]N} and let us put S* — Un^o ■ F° r an V 
text 3^ el, and any <S G (3*(^), & is a toric grammar with integer weights, 

&([ i S*) = 9Q i S*), ^eW\{0}, 
&(wS*) = &(wS*), we(DU {[„}), 

and in particular 

Sf([ S*) = nios*), 

2?(wS*) < &(wS*), w G (D U {[o }) + . 

This means that in any toric grammar obtained by splitting a text, the weights 
of expressions containing the two forms ], and [j of a label are balanced, the 
word frequencies are the same in the grammar and in the text, and the number 
of sentences contained in the text is given by the total weight of expressions 
containing the start symbol [ in the grammar. 
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Proof. For the first assertion, an induction on the number of applications of [3 
yields the result, since 

2-([iS*) = PQiS*) = 0,i elN\{0}, 

and, for any Sf' = ^ 9 ab © a]» © [, b, and any label {0, i}, 



whereas 



&'([iS*) = &{[iS*) + 1, 

sfQis*) = sfQis*) + 1. 



(4.1) 
(4.2) 



(4.3) 
(4.4) 



For the second assertion, it suffices to remark that the weight of expressions 
beginning with a given word is invariant by application of (3. Indeed, any word 
symbol w G D U {[o} appears the same number of times at the beginning of an 
expression of 1 <E> ab and of a]i © [j b. □ 

This lemma is important, because we will subsequently impose restrictions 
on the splitting rule based on word frequencies. Our choice to define a new type 
of grammar as a positive measure on symbol sequences was made to keep track 
of word frequencies throughout the construction. 

Let us now describe the reverse of a splitting transformation, that we will call 
a merge transformation. This transformation will be central in generating new 
texts from a toric grammar, by merging the syntagms into bigger ones, ending 
with a full sentence. 
Definition 4.3 (Merge rule) 

For any toric grammar G 25 we consider the following set of allowed merge 
transformations 

a(Sf) = { s?' e =sf e a], e [,6 © ab }. 

Let us remark that in this definition, necessarily \ {0} ; a]i, [i b, ab G 

and a]j © [»& Sf. 

The merge transformation is indeed the reverse of the split, in the sense that: 
Lemma 4.2 

For any £f,Sf' G /3*(T) ; 5f' G /?(Sf) i/, and on/y # there is f G 5 sucA t/ia£ 
/(Sf) Ga(S?')- 

Proof. Let us suppose that C S' = f(& © a]; © [, b © ab) is in [3(&). Then Sf' = 
f(&) © /(o]i) © /([;&) © f(ab), so that /(a];), /([< 6) G supp(Sf'), /(<*&) G 
supp(/(Sf)), and consequently /(a]i) , f([ib) and /(a&) G S. Moreover f(^) = 
& © f(a)f(b) © /(a) ] /W © [ f{i) /(&), so that /(Sf) G a(Sf')- 

On the other hand, if for some / G 5, /(Sf) G a(&'),f(&) = ^'ffia6©a] 4 ©[, &. 
Since a6 G supp(/(Sf)), there is e G $ such that /(e) = ab. But this implies 
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that there is c, d G S + such that a — /(c) and b = f(d). We can then if needed 
modify / outside {j e I : [ 3 S" e supp(Sf)}, to make sure that i <E /(IN). 
Let f(j) = i. We now get that /(Sf) = Sf' /(c)/(d) f(c)] m Q [ m /(d), so 
that = c]j 8 [ 3 d cd) , proving that Sf' G /?(??). □ 

Another useful property of the merge rule is given by the following lemma: 
Lemma 4.3 

For any f G 5 and any S? £ 0, /(a(S?)) C a(/(S?)) . 
Proof. Indeed, any Sf' G /(a($f)) is of the form 

= /(SO © /(a)/(6) /(a) e b G a(/(Sf)) . □ 

Unfortunately, repeating the merge transformation will not provide a text in 
all circumstances. Indeed, we can end up with some expressions of the type [j a ] %b. 
However, since an expression is allowed to contain only one opening bracket, we 
are sure that [o ^ supp([i a]i b). 

To continue the discussion, we will switch to a random context, where split 
and merge transformations are performed according to some probability mea- 
sure. 

5. Random split and merge processes 

The grammars we described so far are obtained using splitting rules. Texts can 
be reconstructed using merge transformations. The splitting rules as well as the 
merge rules allow for multiple choices at each step. We will account for this by 
introducing random processes where these choices are made at random. 

We will describe two types of random grammar transformations. Each of 
these will appear as a hnite length Markov chain, where the length of the chain 
is given by a uniformly bounded stopping time. 

— The learning process (or splitting process) will start with a text and build 
a grammar through iterated splits; 

— the production process will start with a grammar and produce a text 
through iterated merge operations. 

These two types of processes may be combined into a split and merge process, 
going back and forth between texts and toric grammars. 

Let us give more formal definitions. Learning and parsing processes will be 
some special kinds of splitting processes, to be defined hereafter. 

Definition 5.1 (Splitting process) 

Given some restricted splitting rule (3 r : © — >• 2 & from the set of grammars to 
the set of subsets of ©, such that for any Sf £ ©, Pr{^) C /?(§?), a splitting 
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process is a time homogeneous stopped Markov chain St,0 < t < r defined on & 
such that 

t = mf{t G IN : Pr(S t ) = 0}, 
F(S t = Sf' | S t -i = Sf) > Sf' G /3 r (Sf). 
Definition 5.2 (Production process) 

^4 production process is a time homogenous stopped Markov chain P t , ^ t ^ a 
defined on 25 such that 

a = inf{i G IN,a(P t ) = 0}, 

and 

P(P t = Sf' I P t _i = Sf) > <^=> Sf' G a(Sf). 
Definition 5.3 (Split and Merge process) 

Given a splitting process St,t € IN and a production process Pt,t G IN, a split 
and merge process is a Markov chain G t € <3, t € IN, iw'i/i transitions 

P(G 2t+1 = Sf' I G 2t = Sf) = P(S r = Sf' I 5 = Sf), tew, 
P(G 2t = Sf' I Gat-i = Sf) = P(P CT = Sf' I P = Sf , P ff G X), teK\ {0}, 

whose initial distribution is a probability measure on texts, so that almost surely 

Go el. 

Let us remark that we have to impose the condition that P a e 1, because 
the production process does not produce a true text with probability one. On 
the other hand it can yield back G 2 t_ 2 with positive probability when started 
at C?2t-i, as will be proved later on. Therefore P(P CT € 1 1 Po = Sf) > for 
any Sf such that P(G 2t _i = Sf ) > 0. One way to simulate P Gat | c 2t -i i s to use 
a rejection method, simulating repeatedly from the production process until a 
true text is produced. In the experiments we made, P(P CT <E X 1 Po — Sf ) was 
close to one and rejection a rare event. 

Proposition 5.1 

Let St, Pt and Gt be a splitting process, a production process and the corre- 
sponding split and merge process, starting from Go = 2T G 1- For any Sf G <S, 
any ST' G 1, such that Et e]N p (G2t+i = Sf ) > and E te iN P (G2t = > 0, 

P(r s$ 2[^(P>S'*) - ^"([oS*)] | S = = !. C 5 - 1 ) 
P(cr s$ 2[^(£)S*) - f([oS*)] |P = Sf) =1- (5-2) 

/n other words, the length of all the splitting and production processes involved in 
the split and merge process have a uniform bound, given by twice the difference 
between the number of words and the number of sentences in the original text. 

Proof. This proof is a bit lengthy and is based on some invariants in the split 
and merge operations. It has been put off to appendix A.l on page 29. □ 
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Proposition 5.2 

If Gt is a split and merge process starting almost surely from the text Go = 
3T G T, there is a finite subset of toric grammars such that with probability 
equal to one there is for each time t a grammar G' t isomorphic to G t such that 
G' t £ &sr ■ Thus, after identification of isomorphic grammars, we can analyze 
the split and merge process as a finite state Markov chain, since the reachable 
set from any starting point is finite. We should however keep in mind that the 
finite state space <S & depends on the initial state 2? ', so the state space is still 
infinite, although any trajectory will almost surely stay in a finite subset of 
reachable states. 

Proof. Let us assume that the labels of ^ are taken from [0, )], meaning 
that Sf ([j S*) — for i > W^(Sf ). This can be achieved, up to grammar isomor- 
phisms, by applying to <S a suitable label map. 
Let us define the set of canonical expressions 

g c = <?n( \J[ t s*\ 

V ieu / 

and the canonical decomposition of 

Sf = ^Sf(e)®e. 

We see that 'S can be described by the concatenation of the canonical expres- 
sions, each repeated a number of times equal to its weight, to form a sequence of 
symbols of length W s (&). From the proof of the previous proposition, we know 
that 

W s (fS) < M = 5W W (,T) - 3W e (,7) = h3T{DS*) - 3,5?([ S*). 

We can represent S? by a sequence of exactly M symbols by padding with trailing 
[o symbols the representation described above. Let us give an example 

<S = 2 ® [ Wy ]iW 2 © [l W 3 © [l W 4 

can be coded as 

[ W 1 ] 1 W 2 [o Wi }lW 2 [i W 3 [i w 4 [o [o [o 

in the case when M = 15. Let us consider the set of symbols 

S, 9 = D U {[ , [i , ] u < i < 2 [&{DS*) - ^([ S*)] }. 

Since ^ uses only those symbols, we see from the proposed coding of Sf that it 
can take at most 

\Ser\ M 

different values. Since 

\S*r\ = \D\ + l + 4[£T(DS*) - &([ S*)] 
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we have proved that 

, ,5,7(DS*)-3£?([ S*) 
(\D\ + l + 4[^(DS*)-^([ S*)]j . 

Let us notice that this bound, while being finite, is very large. □ 
6. Splitting rules and label identification 

In the previous section, we introduced some class of random processes, and 
studied some of their general properties. In this section, we are going to describe 
some more specific schemes and go further in the description of split and merge 
processes that can learn toric grammars in a satisfactory way. 

The choice of splitting rules and label identification rules has a decisive influ- 
ence on the way syntactic categories and syntactic rules are learnt by the split 
and merge process. While it is necessary as a starting point to consider rules 
learnt from the text to be parsed itself, it will also be fruitful to consider the 
case when a previously learnt grammar 8% G 25 can be used to govern the splits. 

To make things easier to grasp, let us explain on some example the basics 
of syntactic generalization by label identification. Let us start with the simple 
text with two sentences. 

Go = 8? = [o This is my friend Peter . © [ This is my neighbour John . 

If we split "my friend" and "my neighbour" in the two sentences using the same 
label, we will form after two splits the grammar 

Gi = [ This is ]i Peter . © [ This is ]i John . 
© [i my friend © [i my neighbour 

If no more splits are allowed and we therefore reached the stopping time of the 
splitting process, so that r = 2, we can proceed to the production process, and 
reach after two more steps the new text G2 that can either be G2 = Go or 

G2 = [0 This is my neighbour Peter . © [ This is my friend John . 

Now is a good time to remind the reader of the distinction made in section 3 
on page 6 about local and global expressions. 

Legitimate local expressions will be provided by the reference grammar 
whereas global expressions will be deduced from the text itself. This approach 
will be particularly efficient in the case when the set of local expressions is 
smaller than the set of global expressions. 

We will need two different kinds of split processes, one to learn the reference 
grammar from a text and the other one to perform the first part of the transitions 
of the communication Markov chain. 

These split processes may be viewed as performing some parsing of the text 
they are applied to. Here, we do not use parsing as it is usually used to dis- 
cover whether a sentence is correct or not, we use it instead to discover new 
expressions. 
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We will start by defining the parsing rules to be used in the communication 
chain. We will call them narrow parsing rules. We will then proceed to the 
definition of a broad parsing rule suitable for learning the reference grammar 
from a text. 

Definition 6.1 

Let us define the narrow parsing rule with reference grammar as 
/?„(Sf,^) = { Sf' G : Sf' = Sf ©a], © [ t b © ab, 

ab e £ g , @{[ib) > }, Sf e ©. 

Let us remark that, due to the definition of the set of expressions § and of & C 
the fact that Sf and Sf' £ <S implies that i £ H \ {0} in this defini- 
tion, since necessarily a]j, [j 6 G § . It implies also that [ G supp(a), a condition 
equivalent to ab G S g . 

The narrow parsing rule depends on & only through supp(^) (~l §\. 

Let us define the broad parsing rule as 

p b (y,M) = { Sf' G 25 :Sf' = Sf ©a]*© [ibQab, 

M{a]i) + a([ib) >o,^(as*) ^ mse{[o sr), 

andM(bS*) ^ fj, 2 &([ S*) }, Sf,^G©, 
wftere /ii,/i2 € H+ are £wo positive real parameters. 

Since the reference grammar is under construction during broad parsing, we will 
mainly use this rule with ffi = Sf , as will be explained later. The same learning 
parameters \i\ and fi 2 are present here and in the innovation rule to be described 
next. They serve to split expressions into sufficiently infrequent halves, in order 
to constrain the model. 

Let us define now maximal sequences, a notion that will be needed to define 
learning rules. 

Definition 6.2 

Given some toric grammar Sf , we will say that a G S + is ^-maximal and write 
a G max(Sf ) when 

&{aS*) > mSix{^(awS*),^(waS*),w G S}. 

In other words, a is a maximal subsequence among the subsequences with the 
same weight in Sf. Note that if a is Sf -maximal, usually Sf(a) = (meaning 
that a is not an expression of the grammar, but only a subexpression) and if the 
grammar Sf has integer weights (which will be the case if it has been produced 
by a split and merge process), then ^(aS*) 2. 

Definition 6.3 (Innovation rule) 

Using the notations [ + = {[i,i G IN \ {0}} and ]+ = {]i,i G IN \ {0}}, let us 
define the innovation rule with reference grammar £% as 

Pi(9,&) = { Sf' G & :Sf' =Sf © a], © [,b © ab, 
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&([iS*) = 0, {a, b] n max(f) ^ 0, 

M{aS*) LH^([ S*), and &(bS*) ^ ti 2 ^([ S*) }. 

Here again, the rule will be used while learning the reference grammar with 
3t = <S. 

We will now introduce a label map that identifies the labels appearing in the 
same context. 

Definition 6.4 (Label identification through context) 

Given some toric grammar ^£0, let us consider the relation C G (IN \ {0}) 
defined as 

C=l(i,j)e (1N\{0}) 2 : ]T Sf(a] i )Sf(o] j )+Sf([ i o)Sf([ J -o)>0 

^ aG5* 

The smallest equivalence relation containing C defines a partition o/!N\{0} into 
equivalence classes. Let (^4/c)fce]N\{o} be an arbitrary indexing of this partition. 
Each positive integer falls in a unique class of the partition, so that the relation 
i G A x (j) defines a label map \cg '■ ^ — > IN in a non ambiguous way. The choice 
of the indexing of the partition (Ak)keK\{o} does not matter, since two different 
choices lead to two isomorphic label maps. When applying \<g t° ^ itself we 

will use the short notation x(^) = X^(^)- 

Let us consider the evolution of the number of labels used by & : 

L{&) = \{i GIN : <g{]iS*) > 0}|. 

It is easy to see that L(x(^)) < and that x(^) = ^ if and only if 

L(Xy(&)) — where the symbol = means isomorphic. Accordingly there is 

k G IN such that x k+1 (^) = X k {^)j an d we can take it to be the smallest integer 
such that L(x fe+1 (S^)) = £(x fe (S^))- Consequently, k is such that for any n ^ k, 
X n {&) = X k (^)- We will define x(^) = X k {^), U P t° grammar isomorphisms 
(so that x(^) belongs to ©/= rather than to 25 itself). 

A characterisation in terms of more elementary label maps will be established 
in Proposition A. 6 on page 37. This characterization provides an algorithm to 
compute x in practice. 

We are now ready to define a learning rule. 

Definition 6.5 

Let us define the learning rule 

(Pi(&,&), when P b (&,&) = 0, 

~ I {x(^') } , otherwise. 



We will define two kinds of splitting processes, based on two different choices 
of the restricted splitting rule [3 r . 
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Definition 6.6 (Learning process) 

A learning process is a splitting process with restricted splitting rule 

= ft(Sf). 

Definition 6.7 (Parsing process) 

A parsing process with reference grammar 3% E © is a splitting process with 
restricted splitting rule 

Before we reach the aim of this paper and describe our statistical language 
model, we need to explore some of the properties of the production, learning 
and parsing processes introduced so far. 

7. Parsing and generalization 

Let us introduce some notations for the output of parsing, learning and produc- 
tion processes. 

Definition 7.1 

Let S t be a parsing process, with reference grammar 3? G 25. We will use the 
following notation for the distribution of S T . 

G^,* = P St |So=5'. ^eX. 

We will also use a short notation for the distribution of the output of a produc- 
tion process. 

Eventually, G,y will be the probability distribution of the output of a learning 
process S t , according to the definition 

G^ = P St | So= j7, ^eT. 

At this point we obviously may consider different notions of parsing that we 
have to connect together. Namely, we would like to make a link between the 
following statements: 

— Tc$(£?) > 0, the grammar Sf can produce the text 3~\ 

— &3r,3g(@) > 0, the text & can generate the grammar Sf when parsed with 
the help of the grammar 3?; 

— Gsr(^) > 0, the grammar S% can be learnt from the text 3T . 

Lemma 7.1 

The previous parse notions are related in the following way. For any 
and any Jel, 

G <r(&) > => T<,(&) > 0, 
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Ycg{Sr) > > 0. 

Consequently, for any £ 25 swc/i i/iai (supp(£f) fl <?;) C supp(^) ; and an?/ 

T^(^) >0 G.y,#(Sf) >0. 

Proof. This is one of the core lemmas of this work. The proof is given in ap- 
pendix A. 2 on page 31, on account of its length. □ 

It has the following important implication. 
Proposition 7.2 

Given a parsing process St based on a reference grammar £ (S and a produc- 
tion process Pt, the corresponding split and merge process Gt is weakly reversible, 
in the sense that for any £? £ 1, any Sf £ UteiN su PP(^ > G2t+i) > 

P(Gi = Sf | G = &) > 4=^ P(G 2 = & | Gi = Sf) > 0. 

Consequently, for any ,3?, £ 1 and any £ UteM su PP(^ > G 2 t+i); 

P(G 2 = &' | G = &) > P(G 2 = ^ | G = 5") > 0, 
P(G 3 = Sf' | Gi = Sf) > P(G 3 = Sf | Gi = Sf' ) > 0. 

in ot/ier words, the two processes G 2 t and G 2 *+i are weakly reversible time ho- 
mogeneous Markov chains. As we already proved that the set of reachable states 
from any starting point is finite, it shows that they are recurrent Markov chains: 
they partition their respective state spaces into positive recurrent communicating 
classes. 

Proof. Let us remark first that 

P(Gi = <S | G = &) > G<r,*(SP) > 
P(G 2 = & | Gi = Sf) > ^ T*(&) > 0. 

Moreover, since <S £ supp(P G2t+1 ) for some f 6 W, there is £ 1 such that 
Gj7',^(^) > 0, implying that supp(Sf) D Si C supp(^). This ends the proof 
according to the last statement of the previous lemma. □ 

8. Expectation of a random toric grammar 

In section 7 on the previous page, given some text 5"£l, we defined a random 
distribution on toric grammars G*r that we would like to use to learn a grammar 
from a text. The most obvious way to do this is to draw a toric grammar at 
random according to the distribution and we already saw an algorithm, 
described by a Markov chain and a stopping time, to do this. 

The distribution Gy will be spread in general on many grammars. This is a 
kind of instability that we would like to avoid, if possible. A natural way to get 
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rid of this instability would be to simulate the expectation of G,sr. To do this, 
we are facing a problem: the usual definition of the expectation of Gy, that is 

J Sf dG*-(Sf), 

although well defined from a mathemacial point of view, is a meaningless toric 
grammar, due to the possible fluctuations of the label mapping. To get a mean- 
ingful notion of expectation, we need to define in a meaningful way the sum of 
two toric grammars. We will achieve this in two steps. 

Let us introduce first the disjoint sum of two toric grammars. We will do this 
with the help of two disjoint label maps. Let us define the even and odd label 
maps f e and f a as 

f e (i) = 2i, /„(*) = max{0, 2i - 1}, ie IN. 



Definition 8.1 

The disjoint sum of two toric grammars <$ , C S' G © is defined as 
Definition 8.2 

Given a probability measure G € with finite support, we define the mean 

of G as 

^SfdG(Sf) = ES G(Sf) &j. 

Lemma 8.1 

If Gi is an i.i.d. sequence of random grammars distributed according to G, then 
almost surely 



lim 



Proof. The proof of this result is quite lengthy, and postponed till appendix A. 3 
on page 34. □ 



9. Language models 

We are now ready to define the language model announced in the introduction. 
Given a reference grammar and the corresponding split and merge process 
{Gt)teK w hh reference we define the communication kernel q@(&, .9'') on 
T 2 as 

q^,,r)=I>(G 2 = .r\G = sr). 

According to Proposition 5.2 on page 13 and Proposition 7.2 on the facing page, 
q& has finite reachable sets and is weakly reversible, so that all texts 2? G T are 
positive recurrent states of the communication kernel q@. 
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Thus to each text 3* £ T corresponds a unique invariant text distribution 
Q0l(^, as explained in the introduction. As all states are positive recurrent, 
qse{^i •) is the unique invariant measure of on the communicating class 
containing 8? . Moreover, from the ergodic theorem, 



Go = sr = i, 



showing that q^{3, •) can be computed by an almost surely convergent Monte- 
Carlo simulation. Eventually, from the invariant probability measure on texts 
■)' w e deduce a probability measure on sentences Qm,sr as explained in 
the introduction, according to the formula 

(This is the same formula as in the introduction, taking into account the fact 
that texts in the support of q^(3, •) are non normalized empirical measures 
with the same total mass equal to 8?({oS*) } the number of sentences in the text 

To obtain a true language estimator, there remains to estimate 8% by some 
estimator We will do this as described in section 8 on page 18, putting 



Let us remark that, according to lemma 8.1 on the preceding page, can 
be computed from repeated simulations from the distribution 



10. Comparison with other models 

10.1. Comparison with Context Free Grammars 

Given a toric grammar G j3* (T) , we may consider the split and merge process 
Gt with reference grammar & starting at G\ = & (so here we start at time 1 
with an initial state that is a grammar, instead of starting at time with an 
initial state that is a text). Due to the weak reversibility of Proposition 7.2 on 
page 18, G 2 almost surely falls in the same recurrent communicating class of t M> 
Git-, and the unique invariant probability measure supported by this recurrent 
communicating class defines a probability measure TV on texts, and therefore a 
stochastic language model. This way of defining the language generated by the 
grammar & can be compared to the usual definition of the language generated by 
a Context Free Grammar. Indeed, the support of 'S is a Context Free Grammar, 
so this is meaningful to consider the language generated by this grammar and 
to compare it with the support of our stochastic language model. 
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None of these two sets of sentences is contained in the other one. In our 
stochastic model, the number of times a rule can be used is bounded, so if the 
recursive use of some rules is possible, the deterministic language will in this 
sense be larger. On the other hand, the stochastic model uses both production 
and parsing to build new sentences, whereas the deterministic model uses only 
production rules. In this respect, the stochastic model may, at least in some 
cases, define a much broader language, as we will show on the following example. 

Let us take as dictionary the set 

D = {+, =} U [1, N\, 
where [1, NJ = [i G N, 1 < i < N}, and consider the toric grammar 

N N 

% = N 2 ® [ ]jv =N © 0iV® [ii e0JV(i-l)® + 

i=l i=2 

and the text 

N 

ST = N ® [q i +1 + y + l - N. 

* — ^ N — i times 

It is easy to check that Ty(&) > 0, (so that & E /3*(,7),) that indeed the 
support of ST is the language generated by supp(Sf), seen as a Context Free 
Grammar, and that the stochastic language generated by Sf is able to pro- 
duce with positive probability a set of sentences 

supp 2 (TM d = f |J supp(^), 

,Tesup P (T») 

equal to 

supp 2 (lV) = |[ x\ H h Xi = x i+ i H \-xj, 

i 

l^i<j^2N,x k e [1, Nj, Is? k < j, x k 

fe=l 

Here, the number of sentences produced by the underlying Context Free Gram- 
mar is |supp(5^)| = N, whereas the number of sentences produced by our 
stochastic language model is |supp 2 (TV)| = 2 2 ( N ~ 1 \ Thus, in this small exam- 
ple based on arithmetic expressions (admittedly closer to a computing language 
than it is to a natural language), our new definition of the generated language 
induces a huge increase in the number of generated sentences. 

Note that with usual Context Free Grammar notations, supp(Sf) would have 
been described as 



k=i+l 



\o\->[n\ = n 



i = l,...,N, 
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i = 2,...,N, 



where is the start symbol and |T|, i = 1,...,N, are other non terminal 
symbols. 

To count the number of elements in supp 2 (TV), one can remark that the 
number of ways N can be written as J2k=i Xk w ^ tn an arbitrary number of 
terms is also the number of increasing integer sequences < s\ < ■ ■ ■ < s^-i < 
N of arbitrary length, which is also the number of subsets {si, . . . , Si-i} of 
{1,...,N-1}, that is 2 N - 1 . 

Intuitively speaking, the underlying Context Free Grammar supp(£f ) is lim- 
ited to producing a small set of global expressions of the form i + l + . . . + 1 = N, 
whereas the stochastic language model incorporates some crude logical reason- 
ing that is capable of deducing from them a large set of new global expressions. 

Let us remark also that, when we start as here from a text made of true 
arithmetic statements, the language generated by our language model is also 
made of true arithmetic statements. This shows that our approach to language 
modeling is capable of some sort of logical reasoning. 



10.2. Comparison with Markov models 

The kind of reasoning illustrated in the previous section is related to the fact 
that we analyse global syntactic structures represented by the global expressions 
of our toric grammars. 

In order to give another point of comparison, we would like in this section 
to make a qualitative comparison with Markov models, that do not share this 
feature. To make a parallel between toric grammars and Markov models, we 
are going to show how a Markov model could be described in terms of toric 
grammars and label identification rules. 

To build a Markov model in our framework, we have to use a deterministic 
splitting (or parsing) rule. This is because in a Markov model, conditional prob- 
abilities are specified from left to right in a rigid data independent way. Let us 
introduce the Markov splitting rule 

MSP) = e ®> = & © [o aw }i®[ a }j © [j w } u 

i,j eM\{0},ae D+,w e D,9(\j S*) = 0}. 

We will describe now label identification rules using concepts introduced in 
appendix A. 3 on page 34. Let us say that the pair of labels p£ (fi\ {0}) is 
^-Markov if there is w £ D such that ^{w] Pl S*)^{w) P2 S*) > 0. Let us say 
that the sequence of pairs of labels pi,.. - ,Pk is ^-Markov if pj is £j,i,...,pj-i (5^)- 
Markov. It can be proved as in the case of congruent sequences that if a is a 
permutation and p is Sf-Markov, then p o a is also ^-Markov. It can also be 
proved that if p and q are maximal ^-Markov sequences, then £ p = £ 9 , and 
therefore £ p (Sf) = We will call £ p (Sf) G & /= the Markov closure of and 
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dcf 

use the notation £ p (Sf) = fi{^), where /x(^) is the Markov pendent of x(^) hi 
the construction of toric grammars. 

Let St, ^ t ^ t be a splitting process based on the restricted splitting rule 

/? r (Sf) = { M (sn, ^'e/U^)}- 

It is not very difficult to check that the support of S T is contained in a single 
isomorphic class of grammars, so that, up to label remapping the result of this 
splitting process is deterministic. More specifically, starting from a text 



3=1 



where w\ G D \ {.}, 1 ^ i < £(j), 1 < j ^ n, and wLj-. = . , 1 ^ j ^ n so that 
all sentences end with a period, we obtain a grammar isomorphic to 



n , t(S)-\ 

^ = ([o<u u x 

7 = 1 V i=2 



<0')-i 



^0) 



where we have used words as labels instead of integers, since in this model, due 
to the label identification rule, labels are functions of words (namely ] w is the 
non terminal symbol following the word w € D). 

We can now define a Markov production mechanism, to replace the produc- 
tion process. It is described as a Markov chain Xi, i € IN, where Xi G D U {A}, 
where A ^ D is a padding symbol used to embed finite sentences into infinite 
sequences of symbols, all equal to A for indices larger than the sentence length. 
The distribution of the Markov chain Xi is as follows. Its initial distribution is 



5*([oS*) ' 
and its transition probabilities are 

¥{X i = A\X i _ 1 = .) = !, 

F(X i = .\X i . 1 =w) = ^^, weD\{.} 
nx i = W '\x i -, = W ) = ^0^-, W , W 'eD\{.}. 

Roughly speaking, the difference with the production process P t defined previ- 
ously is that in the production process the production rules are drawn at random 
without replacement whereas here, the production rules are drawn with replace- 
ment. 

It is easy to see that the initial distribution and transition probabilities of the 
Markov chain Xi are the empirical initial distribution and empirical transition 
probabilities of the training text ST . 

In conclusion, to build a Markov model using the same framework as for toric 
grammars, we had to modify two steps in a dramatic way: 
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— wc had to change the splitting process, and replace the random split- 
ting process of toric grammars with a non random splitting process which 
chains forward transitions in a linear way; 

— wc had to change in a dramatic way the label identification rule to re- 
place the forward and backward global condition of toric grammars with a 
backward only local condition. 

(The modification of the production process is less crucial and boils down to 
drawing production rules with or without replacement.) 

We hope that this discussion of Markov models will help the reader realize 
that our model proposal is indeed really different from the Markov model at 
sentence level. We could have extended easily the discussion to Markov models 
of higher order, or to more general context tree models. We let the reader figure 
out the details. All these more sophisticated models show the same differences 
from toric grammars: a more rigid splitting process and local backward label 
identification rules. 

11. A small experiment 

Let us end this study with a small example. Here we use a small text that is 
meant to mimic what could be found in a tutorial to learn English as a foreign 
language. We have added a more elaborate sentence at the end of the text to 
show its impact. More systematic experiments are yet to be carried out, although 
the conception of this model was guided by experimental trial and errors with 
models starting with variable length Markov chains, before we tried global rules 
leading to grammars. 

This is the training text & (each line shows an expression, starting with its 
weight) : 

1 [0 He is a clever guy . 

1 [0 He is doing some shopping . 

1 [0 He is laughing . 

1 [0 He is not interested in sports . 

1 [0 He is walking . 

1 [0 He likes to walk in the streets . 

1 [0 I am driving a car . 

1 [0 I am riding a horse too . 

1 [0 I am running . 

1 [0 Paul is crossing the street . 

1 [0 Paul is driving a car . 

1 [0 Paul is riding a horse . 

1 [0 Paul is walking . 

1 [0 Peter is walking . 

1 [0 While I was walking , I saw Paul crossing the street . 

And now, the new sentences produced by the model ( that is by , 
approximated on 50 iterations of the communication chain with kernel q~ ). 

1 [0 Paul is driving a car too . 
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1 [0 Paul is doing some shopping . 

1 [0 Paul is laughing . 

1 [0 Paul is riding a horse too . 

1 [0 Paul is running too . 

1 [0 Paul is running . 

1 [0 Paul is not interested in sports too . 

1 [0 Paul is not interested in sports . 

1 [0 Paul is a clever guy too . 

1 [0 Paul is a clever guy . 

1 [0 Paul is walking too . 

1 [0 Peter is driving a car too . 

1 [0 Peter is driving a car . 

1 [0 Peter is doing some shopping . 

1 [0 Peter is laughing . 

1 [0 Peter is riding a horse too . 

1 [0 Peter is riding a horse . 

1 [0 Peter is running too . 

1 [0 Peter is running . 

1 [0 Peter is not interested in sports . 

1 [0 Peter is a clever guy . 

1 [0 Peter is crossing the street . 

1 [0 He is driving a car too . 

1 [0 He is driving a car . 

1 [0 He is riding a horse too . 

1 [0 He is riding a horse . 

1 [0 He is running too . 

1 [0 He is running . 

1 [0 He is not interested in sports too . 

1 [0 He is crossing the street too . 

1 [0 He is crossing the street . 

1 [0 He is walking too . 

1 [0 I am driving a car too . 

1 [0 I am doing some shopping . 

1 [0 I am laughing too . 

1 [0 I am laughing . 

1 [0 I am riding a horse . 

1 [0 I am not interested in sports . 

1 [0 I am a clever guy . 

1 [0 I am crossing the street too . 

1 [0 I am crossing the street . 

1 [0 I am walking too . 

1 [0 I am walking . 

1 [0 While I was driving a car , I saw Paul doing some shopping too . 

1 [0 While I was driving a car , I saw Paul doing some shopping . 

1 [0 While I was driving a car , I saw Paul riding a horse . 

1 [0 While I was driving a car , I saw Paul crossing the street . 

1 [0 While I was driving a car , I saw Paul walking . 

1 [0 While I was driving a car , I saw Peter riding a horse . 

1 [0 While I was doing some shopping , I saw Paul riding a horse . 

1 [0 While I was doing some shopping , I saw Paul walking . 
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1 


[0 


While 


I 


was 


laughing 


too , 


I saw Peter crossing the street . 


1 


[0 


While 


I 


was 


laughing 


, I saw Peter riding a horse . 


1 


[0 


While 


I 


was 


riding a 


horse 


, I saw Paul driving a car too . 


1 


[0 


While 


I 


was 


riding a 


horse 


, I saw Paul driving a car . 


1 


[0 


While 


I 


was 


riding a 


horse 


X o 3TT Paul 1 211frll 1 Tl (T 

, ± bdW rdlil IdUgllXIlg 


1 


[0 


While 


I 


was 


riding a 


horse 


, I saw Paul running . 


1 


[0 


While 


I 


was 


riding a 


horse 


, I saw Paul walking . 


1 


[0 


While 


I 


was 


riding a 


horse 


, I saw Peter not interested in sports 


1 


LU 


While 


I 


was 


running 


I saw 


Paul laughing . 


1 


[0 


While 


I 


was 


running 


I saw 


Paul not interested in sports . 


1 


[0 


While 


I 


was 


running 


I saw 


Paul a clever guy . 


1 


[0 


While 


I 


was 


running 


I saw 


Paul walking . 


1 


LU 


While 


T 
1 


was 


not interested 


in sports , I saw Paul driving a car . 


1 


LO 


Wnile 


1 


was 


not interested 


in sports , I saw Paul riding a horse 


1 


[0 


While 


I 


was 


a clever 


guy , 


I saw Paul running . 


1 


LU 


While 


T 
1 


was 


a clever 


guy , 


I saw Paul crossing the street . 


1 


[0 


While 


I 


was 


a clever 


guy , 


I saw Paul walking . 


1 


rn 

LU 


While 


T 
1 


was 


crossing 


the street , I saw Paul riding a horse . 


1 


LU 


While 


T 
1 


was 


crossing the street , I saw Paul running . 


1 


[0 


While 


I 


was 


crossing 


the street , I saw Paul crossing the street 


1 


rn 
LU 


While 


T 
1 


was 


crossing 


the street , I saw Paul walking . 


1 


[0 


While 


I 


was 


crossing 


the street , I saw Peter walking . 


1 


[0 


While 


I 


was 


walking 


I saw 


Paul driving a car . 


1 


[0 


While 


I 


was 


walking 


I saw 


Paul laughing . 


1 


[0 


While 


I 


was 


walking 


I saw 


Paul riding a horse . 


1 


[0 


While 


I 


was 


walking 


I saw 


Paul running . 


1 


[0 


While 


I 


was 


walking 


I saw 


Paul not interested in sports . 


1 


[0 


While 


I 


was 


walking 


I saw 


Paul crossing the street too . 


1 


[0 


While 


I 


was 


walking 


I saw 


Paul walking . 


1 


[0 


While 


I 


was 


walking 


I saw 


Peter not interested in sports . 


1 


[0 


While 


I 


was 


walking 


I saw 


Peter walking . 



The reference grammar was learnt first, and was computed from f samples of 
Gy. (We did not normalize the weights, since we were interested in the support 
of the local expressions only.) 

10 [0 He likes to walk ]6 ]3 streets . 

2 [0 ] 1 ] 8 clever guy . 

2 [0 ] 1 doing some shopping . 

2 [0 ] 1 laughing . 

2 [0 ] 1 not interested ]6 sports . 

2 [0 ]1 riding ]8 horse . 

2 [0 ] 1 riding ] 8 horse ] 2 . 

2 [0 ] 1 running . 

24 [0 ]7 am ]5 . 

28 [0 Paul is ]5 . 

40 [0 He is ] 5 . 

4 [0 ] 1 crossing ]3 street . 

4 [0 ] 1 driving ] 8 car . 

5 [0 ]4 is ]5 . 

6 [0 ] 1 walking . 
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7 [0 Peter is ]5 . 

8 [0 While ] 7 was ] 5 , ] 7 saw ] 4 ] 5 . 
10 [1 He is 

2 [1 Peter is 

2 [1 While ]7 was ]5 , ]7 saw ]4 

6 [1 ] 7 am 

8 [1 Paul is 

2 [2 too 

30 [3 the 

14 [4 Paul 

1 [4 Peter 

16 [5 crossing ]3 street 

16 [5 driving ] 8 car 

16 [5 riding ] 8 horse 

34 [5 walking 

8 [5 ]5 too 

8 [5 ] 8 clever guy 

8 [5 doing some shopping 

8 [5 laughing 

8 [5 not interested ]6 sports 
8 [5 running 
20 [6 in 
50 [7 I 
50 [8 a 

Although we did not yet make the software development effort required to 
test large text copora, we learnt a few interesting things from what we already 
tried: 

— As it is, the model requires the inclusion of a sufficient number of simple 
and redundant sentences to start generalizing. At this stage, we do not 
know whether this could be avoided by changing the learning rules. We 
made quite a few attempts in this direction. All of them resulted in the 
production of grammatical nonsense. Breaking the global constraints that 
are enforced by the model seems to have a dramatic effect on grammatical 
coherence. This could be a clue that these global conservation rules reflect 
some fundamental feature of the syntactic structure of natural languages. 
Including a bunch of "simple" sentences made of frequent words may be 
seen as introducing a pinch of supervision in the learning process. 

— The constraints on subexpressions frequencies in the learning rule 6.1 
(page 15) and 6.3 were added to avoid some unwanted generalizations. 
For instance here we took/ii^([ 5*) = ([o S*) = 5. If we had chosen 
10 instead of 5, sentences of the kind 

[0 While I was walking , I saw He crossing the street . 

would have emerged, where the pronoun "He" is substituted to a noun in 
the wrong place. We deliberately wrote the training text in such a way 
that "He" is more frequent than any noun, since we expect that to be 
true for any reasonable large corpus. Doing so, we were able to rule out 
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the wrong construction by lowering the frequency constraint to avoid the 
unwanted substitution. 

— Despite all the limitations of this small example, it shows that the model 
is able to find out non trivial new constructs, like 

[0 While I was laughing too, I saw Peter crossing the street. 

where it has discovered that "too" could be added to the subordinate 
clause opening the sentence. We are quite pleased to sec that such things 
could be learnt along very general label identification rules, while all the 
generalized sentences remain, if not all grammatically correct, at least all 
grammatically plausible. Of course this judgement is purely subjective. 
But since we have no mathematical or otherwise quantitative definition 
of what natural languages are, we have to be content with a subjective 
evaluation of models. 

Studying how this learning model scales with large corpora is still a work to 
be done (it will require from us that we optimize our code so that it can run 
efficiently on large data sets). 

12. Conclusion 

We have built in this paper a new statistical framework for the syntactic analysis 
of natural languages. 

The main idea pervading our approach is that trying to estimate the distri- 
bution of an isolated random sentence is hopeless. Instead we propose to build 
a Markov chain on sets of sentences (called texts in this paper), with non trivial 
recurrent communicating classes and to define our language model as the invari- 
ant measures of this Markov chain on each of these recurrent communicating 
classes. At each step, the Markov chain recombines the set of sentences consti- 
tuting its current state, using cut and paste operations described by grammar 
rules. In this way we define the probability distribution of an isolated random 
sentence only in an indirect way. We replace the hard question of generating a 
random sentence by the hopefully simpler one of recombining a set of sentences 
in a way that keep the desired distribution invariant. 

The strong points of our approach are 

— a decisive departure from Markov models that are known to fail to catch 
the recursive structure of natural languages; 

— a new "communication model" concept that defines a Markov chain on 
texts and in parallel on toric grammars. This results in a new definition 
of the language generated by what appears as a weighted Context Free 
Grammar (called a toric grammar in the paper). This new perspective on 
language production may help to overcome the challenge of weak stimulus 
learning; 

— in this respect, the split and merge process with reference grammar 8% 
is the major mathematical achievement of the paper. It has non trivial 
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mathematical properties proving that it can be simulated using a bounded 
number of operations at each step, and that the state space is divided into 
recurrent communicating classes each including a finite number of states; 
— preliminary experiments on small corpora are encouraging. They give the 
(acknowlcdgcdly subjective) feeling that the model catches the structure 
of the natural languages we tried (French and English). Some inflection 
rules and other grammatical subtleties may be missed, but experimental 
outputs nevertheless give us the impression that we are heading in the 
right direction. 

On the other hand, the model needs some refinements. In particular, our 
proposal to build a reference grammar from a text Jel through the grammar 
expectation 



is clearly only a first foray into unknown territory. We hope to be able to elab- 
orate more on this part of our research program in the future. 

Appendix A: Proofs 

A.l. Bound on the length of splitting and production processes 

Proof of Proposition 5.1 on page 12. Let us define the length of an expression 
e G S k n § as l{e) = k. Let us introduce some remarkable weights associated 
with a grammar ^ G /?* (1) . 




W e (Sf) 



ee<? 



toG-D 



Let us define the set of canonical expressions as 




Using previously introduced notations, we can write the grammar as 



^ = E ^( e ) ® e - 
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We will call this the canonical decomposition of <S '. The two weights W s {^) and 
W e (&) are better understood in terms of this canonical decomposition. They 
can be expressed as 

W s {&) = Sf(eK(e), 

eG<? c 

This shows that W s (&) counts the "number of symbols" in the canonical de- 
composition of Sf, whereas W e {^) counts the number of expressions (that is 
^{S c ), the weight put by the grammar on canonical expressions). We can also 
see from the definitions that Wj(Sf) counts the number of canonical expressions 
starting with a positive (that is non terminal) label, that we will call for short 
the number of labels, and that W w {^) counts the number of words. 

Since a split increases the number of canonical expressions by one, the number 
of symbols in canonical expressions by two, the number of labels by one, and 
keeps the number of words constant, whereas a merge decreases these quantities 
in the same proportions, the following quantities are invariant in all the toric 
grammars involved: for any & £ © such that 5Z te]N ^ipt = &) > 0, 

W,(Sf) - 2W e (&) = W s {&) - 2W e {^), 
W e (<#) - Wi{&) = W e (,T) - Wi(,7) = W e (&), 

w w {y) = w w {3-). 

Moreover, for the same reasons, for any 2?' e X and Sf 6 © such that 
E teK P(G 24 = F) > and £ te w P(G 2t+1 = Sf) > 0, 

p(t = Wi{s T )\s = .r) = i, 

p(ct = Wi(Sf) | P Q = Sf , P a e l) = 1. 

Thus, we will prove the lemma if we can bound Wi{^) (or equivalently Wi(S T ) 
when 5*0 = , since S T almost surely satisfies the conditions imposed on Sf). 
We can then remark that 

&(e)t[l(e) > 3] < Yl & WW) ~ 2 ] = ~ 2W ^), 

eG<? c eG<f c 

J2 Sf(e)l[^(e) = 2] = ^Sf(e)l[£(e) = 2] ^ 1 ( e e u;S '*) 

e£<f c e£<f w£D 

< E^ e ) E 1 ( e e = ^(^)' 

because any canonical expression of length 2 is of the form e = [iiv, with i £ 1 
and w e D, so that for any e G S c of length 2, 

£ ^l(e'e^)=l. 

e'ee(e) ii'GD 
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Thus 

W e (SP) < W w (9) + W a (Sf) - 2W e {&), 
and consequently we can bound Wi{^) by the split and merge invariant bound 

W^) Wi(Sf) - W e {<#) + W w {<#) + W s {&) - 2W e (<$). 

This, added to the fact that Wi{&) = and W s (^) = W w (^) + W e (^), proves 
that 

W I (Sf)<2[W 10 (^)-W e (^)]. 
This ends the proof, since W w {&) = &{DS*) and W e {^) = &{[oS*). □ 

A. 2. Parsing Relations 

Proof of lemma 7. 1 on page 1 7. The implication 

> > 

is less trivial than it may seem. Indeed we can reverse the path of the splitting 
process S t , be it a parsing or a learning process, to obtain a path followed with 
positive probability by the production process, but reversing the production 
process does not give a parsing process. Let us illustrate this difficulty on a 
simple example. Consider 

^ = 1 © [ afecd and ^ = [ O o]i © [ib} 2 © [ 2 c] 3 © [3d. 

The production path 

[0 ab} 2 ® [2 c} 3 ® [3 d, [ ab} 2 © [2 cd, ST 

has positive probability. The reverse path may have a positive probability for 
the learning process but not for the parsing process with reference , since 
none of the expressions [0 a&] 2 or [ 2 cd belongs to the support of Sf . To parse 3T 
according to Sf , one can instead follow with positive probability such a path as 

ST, [0 abc} 3 © [ 3 d, [ ab} 2 © [ 2 c] 3 © [3d, Sf. 

To prove the lemma, we will have to show that it is always possible to find such 
an alternative parsing path. This property is fundamental to our approach, since 
it proves that the toric grammars we build can be used to parse the texts they 
can produce. 

Let us start with the easiest part of the proof. Assume that Qy,3i(&) > 0. 
This means that there is a path . . . , % such that % = & % % = and ^* € 
/3 n (@t-i,&)- Anyhow it is easy to check that 

so that the reverse path is followed with a positive probability by the production 
process. This means that T^(i^) > 0. 
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In the case of the learning process, if (Sf) > 0, there is a path Sf t , < t < k, 
such that £ ft(^t_i), % = ^ and % = Sf, consequently there is a label 
map ft & 5 such that f t {^t-\) £ Qt(^t)- We can then remark that 

A o • ■ ■ o / t (Sf t _i) £ a(/ fc o • • • o / t+ i(Sf t )), 

because as already proved before in lemma 4.3 on page 11, /(a(#)) C a(/($f)). 
Let us consider the path % — f k o ■ ■ ■ o fk-t+i (&k-t) ■ It begins at Sfo = % = ^ 
and ends at Sfj, = fk ° • ■ • ° /i (%) = According to the previous remark, this 
path is followed by the production process with positive probability, proving 
that Ty(&) > 0. 

Let us now come to the proof of the third implication of the lemma. For 
this let us assume now that T&(&) > 0. Consider a path . . . , % such that 
% = ^,---,^k — & and £ &(&t-i)- We are going to define some dec- 
orated path . . . ,&k with some added parentheses. Introduce a new set of 
symbols B = {(j , )j, i £ IN \ {0}} and assume that it is disjoint from the other 
symbols used so far, so that B n 5 = 0. Consider the set of toric grammars & 
based on the enlarged dictionary DUB, and the projection it : & — > defined 
with the help of the canonical decomposition of toric grammars as 

eG<f c eG<? c 

where <? c is the set of canonical expressions based on the enlarged dictionary DU 
B, and where 7r(e) is obtained by removing from the sequence of symbols e the 
symbols belonging to the decoration set B (that is the parentheses). 

Let us put Sf = & and define Sf t for t = 1, . . . , k byjnduction. We will 
check on the go that 7r(Sf t ) = It is obviously true for Sf , because % £ 6, 

so that 7r(%) = Sfo = That said, let us describe the construction of 5^, 

assuming that %-i is already defined, and satisfies 7r(^t-i) = S^t-i- Consider 
the sequence of symbols a and b £ S 1 * and the index z £ IN \ {0} such that 

&t = &t-i®abea}iG[i b. 

Since 7r(Sf t _i) = &t-i, an d since o]j © [j 6 ^t-i, there are a e S* and 6 £ S* 
such that 7r(a) = a, 7r(6) = 6, and a]i © [j 6 ^ (The choice of a and 6 may 

not be unique, in which case we can make any arbitrary choice). Let us define 

&t = ^t-i®a{ib)iQa}i 9 [J>- 
Since 7r(a(, = 7r(a&) = ab, 

tt(^) = Tr(Gt-i) © n(a{ib)i) © 7r(5];) © ?r([ib) - 9 t -i © a& 6 a] t © [< 6 = <S U 
where we have used the obvious fact that 7r is linear. 
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We are now going to^dcfine another mapping between grammars that allows 
to recover Sf from any % (obviously the decorations where added to keep track 
of ). Let us define ip : &' — >• & on the set of decorated grammars 6' which arc 
supported by expressions where the parentheses (j )j are matched (at the same 
level) by the formula 

where ip(e) is defined by the rules 

_ {ip([ia]jc) +i>(\jb), if e=[ia(jb)jC, with a, 6, c e 5* 
I ^(e) = 1 e, otherwise. 

It is easy to check that this definition is not ambiguous and that 

He) = V>'(e) 6 [iV'(a), 

(ia)iGsupp(e) 

where ^'(e) is the expression obtained from e by replacing all the sequences 
between outer parentheses pairs (jd)j by ]j. This is may be easier to grasp on 
some example: 

^([00(16(20)2^)16(3/(45)4)3/1) = [oa]ie] 3 h © [1 b] 2 d © [ 2 c © [ 3 /] 4 © [ 4 3. 

It is^ easy to check by^induction that Sft G & ■ Let us check moreover that 
i>(y t ) = v. Indeed ip(%) = V>(Sf) = Sf and 

V(g*) = v(G t _i) ev(o(<&)i) ev(o]i) ev([J) =^(G t _i), 

since V is linear and i()(a(i 6)j) — ip(a]i) © ^([i 6). 

We are now going to define a continuation for the path (& t , ^ t < fc) that 
will bring us back to £f . 

We will maintain during our inductive construction two properties: 

1>(& t ) =Sf, 
and supp(^)nijc^, 

where is the set of local decorated expressions, so that 

§1 = {eel: [o^supp(e)}. 

We already proved that the first property is satisfied by £ffc- As 7r(Sffc) = % = 
supp(%) n^i = 0, so that the second condition is also satisfied. Let us assume 
that, for some t > k, & t -i has been defined and satisfies the two conditions 
above, and let us proceed to the construction of §P t . 
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As long as Sf t _i ^ &, (and this will be the case for t < 2k), find some canonical 
expression e G $c\$ c , such that Sf t _i(e) ^ 1. From our induction hypotheses, we 
see that necessarily [ G supp(e). Our continuation will be such that each such 
expression has matching parentheses with matching labels, and we will check 
this on the go while building it by induction. Among those matching pairs of 
parentheses, there is necessarily at least one inner pair. We can for instance 
choose the one starting with the last opening parenthesis (j of the sequence e. 
This choice makes it obvious that the subsequence of e enclosed between (j 
and )j contains no further parentheses. 

Since ip is linear and preserves positive measures, &Qip(e) — ^(Sft_i)0r/;(e) = 
^(^t_i9e) ^ 0, On the other hand, e has the form e = [ a(jb)jC, where tl>(b) = b 
(since is an inner pair of parentheses in e). As tp(e) = ip([ a]jC) +tp([jb) 
and tp([jb) = [jb, this shows that [j b ^ Sf, and therefore that ^(\jb) > 0. Let 
us now define 

&t = &t-i Gee [jb (B [oa]jC. 

Applying tp to Sf t , we see as previously that ip(@t) = 4>(&t-i) = & ■ As 5^ 
contains fc pairs of parentheses, and we consume one pair at cach_step t >Jc, we 
see that %fe contains no more parentheses, so that &2k G 25 and &2k = tyif&ik) — 
. Let us put now ^ t = 7r(^t) , for t = k + 1, . . . , 2k. We see that 

% =&t-ie [o abc 8 [o a ]jC 8 [j b, 

where [o abc, [o a ] 3 ;c G £ and ^(\jb) > 0, so that Sft G /3„(£f t _i, £f ), therefore 
% = 2T ,.. . ,^2k — & is a path of positive probability under the parsing pro- 
cess with reference Sf, leading from to in other words, > as 
required. □ 



A. 3. Convergence to the expectation of a random toric grammar, 
proof of lemma 8. 1 on page 1 9 

The proof of this results is based on the fact that the operation 

(S?,S?') i— > x(^ffl^) 

is associative. 

Let us begin the proof by several definitions and lemmas. 

For any grammar ^ G 65 and any pair of indices p = (p 1 , p 2 ) E (IN \ {0}) , we 
will say that p is ^-congruent when there is a G S* such that &(a] p i )Sf (a ] p 2 ) > 
or a)Sf ([ P 2 a) > 0. 

Let us define the label map £ p as 



i, when z ^ {p 1 ,^ 2 }, 

minlp 1 ,^ 2 }, when z G {p 1 ,^ 2 }. 
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For any sequence pi,...,Pk £ (IN \ {O}) 2 * of pairs of indices, let us define the 
label map £ Pl ,..., Pk as 

£pi,...,pjs — ^£. P1 ,...,p k _ 1 (Pk) £pi,---,Pfc-i ' 

where f((i,j)) = (f(i),f(j)), for any e (M\ {0}f . 

Let us say that (pi, . . . ,Pfc) S (IN \ {0}) is ^-congruent, if £ Plt ..., Pj _ 1 (pj) is 
£pi,...,Pj_i (^) -congruent for any j ^ fc, and that it is maximal ^-congruent is it 
is ^-congruent and any ^-congruent sequence of the form (pi, . . . ,Pfe,Pfe+i) is 
such that 

(P*+i) (Pfc+i). 
or equivalently such that £ Pl ,...,p fc+1 = £pi,...,p fc • 

Lemma A.l 2£ 

For any sequence (pi,..., pi) G \ {0}^ , /or any k < I, 

£pi,...,Pe = ^ P1 ,..., Pfc (p*+i,...,p<) °£pi, — ,Pfc- 

Proof. By induction on £ for fc fixed. This is true from the definition for i = k+1. 
Assuming we have established the lemma for £ — 1, we can write 

£pi,...,Pe = ^Ipi,..., p< _ 1 (pe) £pi>---,P<-i 

= ^«pi,...,p fc (Pfc + i.-,Pi_i)°fpi.-.p*(P<) ° ^pi P t (PHi.-.P«-i) ° £pi,-,P* 

= ^ipi,...,p fc (P)c+iv,P<) ° £pi>---,Pfc> d 

Lemma A. 2 

For any permutation a of {1, ... ,k}, 

£pi,...,Pfc = Cp a (i)>--->p<,(fc) 

Proof. Let us consider the smallest equivalence relation containing the set 
{pi, . . . ,Pfe}. Let 7Ffc : M \ {0} — > ^ be the corresponding projection of each 
label to its component. Let us befine the label map irk by 7Tfc(0) = and 

7Tfe(i) = min7Tfe(i), i > 0. 

We are going to prove by induction on k that £ Pl ,..., Pk = ^k- Since -Kk is invariant 
by permutation of the sequence (pi, . . . ,Pk), this will prove the lemma. 
Let us remark now that € Pl ,..., Pk = ^k if and only if 

tpu..., Pk (i) = € P i,..., Pk U) Kk{i) = 7T fc (j), i,j > 0. 

So we are going to prove this equivalence. It is easy to see from the previous 
lemma that for any integer m = 1, . . . , k, 

Z Pl ,-,p k (Pm) = Zpu-, Pk (P m )- (A- 1 ) 
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Indeed, 

£pi,..., P fc — ££pi,...,p m (Pm+l,-",Pfc) ° ^fpi,...,p m _i (Pm) ° £piv,Pm-H 

so that, changing p m for ^ Pl ,...,p m _ 1 (p ra ), we are back to proving the result when 
m = k = 1, where it is obvious from the definitions. 

Now, eq. (A.l) on the previous page and the minimality of 7Tfe implies that 

7Tfe(i) = 7r fe (j) => t Pl ,..., Pk (i) = t P i,..., Pk (j), i,3 > 0. 

Let us assume conversely that £, Pl ,..., Pk (i) — £, P i,..., Pk {j) an d ^ 

m = min{£ : £ Pl ,..., pe (i) = € Pl ,..., Pi (j)}- 

Since £ Pl ,..., Pm = ^ P1 ,..., Pm _ 1 ( Pm ) € P i,..., Pm -i . we see that necessarily 

i,...,p m _i ({pLp™}). 

and that this set contains two distinct elements. Exchanging the role of i and j 
if necessary, we can assume without loss of generality that 

£pi,...,p m _i = £pi,...,p m _i (pm) • 

From the induction hypothesis, this implies that 7r m _i((i,j)) = 7r m _i(p m ). 
Since the equivalence relation defined by 7r TO _i is a subset of the equivalence 
relation defined by ir^, this implies that TTk((i,j)) = ^k(Pm)- Since moreover 
Tk(Pm) = ^fe). tms implies that 7r fe (i) = 7r fe (j). □ 
Lemma A. 3 

For any / G any sequence of pairs of positive labels Pi,...,Pk> there is a label 
map g G $ such that 

£/(pi,...,p fc ) / = 9°€pi,...,Pk- 
Proof. We have to prove that 

£pi,..., Pfc (i) = t P i,-, Pk (j) => 0(Pi,-,P fc ) °/W = 0(Pi,-,P fc ) ° /0')' «,i>0- 

From the proof of the previous lemma, it is enough to check that the right-hand 
side holds when — p m , m = 1, . . . , fc, which is then obvious. □ 

Lemma A. 4 

If f £$ and (pi, ...,pk) is ^-congruent, then (f(pi), f(pk)) is also f(<3)- 
congruent. 

Proof Assume that for some a G S* 

Zpu...,p m -i@)(ak n ..... Pm _ 1 (pM)>0- 
Then, £/( P i,..., Pm _i) o / = jo( P i,..., Pm -i, and 
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f/(pi,..., Pm -i) /(SO (.9( a )k /(P1 Pm _ l) °/(PL)) 

= .9 o ^p 1 ,..., Pm _ 1 (Sf) (g(a) ]goi P1 ,..., Pm _ 1 (pi,)) 

= £ Pl ,.., Pm O0 (ff~ 1 °ff(a]Cp 1 ,...,p m _ 1 (i.3 il ))) 

>^ 1 ,.., Pm - 1 (^)(«]« P1 p ra _ 1 (pi,)) >°- 

The same is true when pL is replaced with pL and when a If . is 

S P1 Pm-HPm' 

replaced with off , . 

The lemma is a straightforward consequence of these remarks and the defi- 
nition of a congruent sequence. □ 

Lemma A. 5 

If (pi, . . . ,pk) and (qi, . . . , are both <$ -congruent, then 

(pi, . . . ,p fc ,gi, ...,%) 

is & -congruent. 

Proof. According to the previous lemma, ... jPt . (qi, qi) is £ Pl ,... tj 
congruent. Coming back to the definition this proves that 

^Pi,..., Pfe (9i,-,9f-l) ° Cpi,-,Pk(%) 



is 



$£ P1 .... , Pfe ( 91 ° &i ,. • • , Pk {&) -congruent. 

In lemma A.l on page 35 we have moreover proved that 

£,pi,...,p k ,qi,...,qe-i — ^ P1 ,...,p k (qi,...,qt-i) ° £pi, ...,Pfc- 

This identity applied to the above statement shows that (pi, ■ ■ ■ ,Pk, <?i, ■ ■ ■ , <fe) 
satisfies the definition of a -congruent sequence. □ 

Proposition A. 6 

If (pi, . . . ,pk) and (qi, . . . , qt) are both maximal 'S ' -congruent, then 

Proof. From the previous lemma, (pi, . . . ,pk, qi, ■ ■ . , qt) is ^-congruent. Since 
p is maximal, t Pl ,..., Pk , qi ,..., qe = £ Pl ,..., Pk - In the same way £ qi ,..., qt ,p u ..., Pk = 
£ qi ,..., qe - We have seen moreover in a previous lemma that 



£,p 1 ,...,p k ,q 1 ,...,qt — 



,qt,pi,...,Pk ■ 



This proves that € Pl ,..., Pk = £, qu ..., qi - 

We see from the definition of \ ( see Definition 6.4 on page 16) that there is 
some maximal {^-congruent sequence n, . . . , r m such that x(^) = £n,...,r m {&) ■ 
Therefore 
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Proposition A. 7 

For any&,&' G &, 

x(x(#)fflsf) =x(sf fflsf')- 

Consequently, for any G 6, 

x(x(sf ffl Sf') ffl Sf") = x(^ a Sf ffl • 
Proof. Let us assume that Sf, <S' and x(^) use disjoint label sets, so that 
X (x(Sf)fflSf') = x ( x (Sf)+Sf'), 

x(^fflsf') =x(^ + ^')- 

Let pi,. . . ,pk be some maximal £f -congruent sequence. It is also (Sf + ^')- 
congruent, and since label sets are disjoint, 

Let us continue the sequence pi, . . . ,pk to form a maximal -congruent 
sequence pi, . . . ,pg. Let (qt+i, ■ ■ ■ , qe) be defined as 

Qm £,p 1 ,...,p k {Pk+m) ■ 

We see from the definitions that (q/c+i, • ■ • , qi) is a maximal ^p 1: ....p k \S -t- 
congruent sequence, and therefore a maximal (^ Pl ,..., Pk (&) + Sf'^ -congruent se- 
quence. Consequently 

= €q k +i, — ,qt ° Cpi,...,p fe + ^ ) = ^^...^(.(pfe+i,...^) ° Cpi,...,p fc + ^ ) 

= £ P1 ,..., w (^ + ^0 =x(V + ^'), 

proving the proposition. □ 

Proof of lemma 8.1 on page 19. Let 7r be the projection of © on ®/=. 
From the law of large numbers, we have that, for all Sf e ©, 

1 ™ 

-^Tl^EESf) — > G(7T(Sf)). 



n . 

?,=i 



Let us now remark that ffl n _1 G; = ffl ffl n Gi. Thus 

: i 



-x I ffl d = x _EB x ffl » _1 g. 



n 1 »=i 
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We used here Proposition A. 7 on the preceding page and the fact that for any 
a, b e R + , 

X [(aSP)ffl(6Sf)] = (a + 6)x(Sf), 
which comes from the following reasoning: Suppose that 

{l,...,d} = {»; Sf ><)}, 

and let pi = (2i, 2i — 1). Since each pi is (a&) ffl (65f)-congrucnt, (pi, . . . is 
also (aSf) ffl (feSf)-congruent, from lemma A. 5 on page 37. It is quite straight- 
forward to see that 

[(aSf)EB (&<*)] =(a + b)&. 

This implies that 

X [(aSf)ffl(6Sf)] =xo^,..., Pi [(a^)ffl(W)] = x [(a + 6)Sf] = (a + 6) x (Sf). 

To take the limit inside \-> we need to prove that x is continuous in a suitable 
sense. Actually, Q i-> x{Q) is continuous on sets of fixed support, and this is 
what is required to conclude. 

Indeed, for any sequence {%) with fixed support for n large enough, there is 
a fixed label map / (depending on the support) such that for n large enough 
x(S^i) = fi^i), and the result follows from the fact that Sf i-> /(Sf) is continuous; 
since /(S?)(A) =Sf(/- 1 (A)). 

Consequently 

lim -x\ ffl G l ) =x\ ffl lim (- Vl(Gi ) 

n->oo n v »=i y we®/= ™^°° \ n ' I 

= x (ffl ) = x (_ffl G(^) X (^) | 

v »6®/= y v »g®/= y 

= x ( _ffl ffl <G(S?)x(Sf) ) = x f ffl G(Sf)x(Sf) 
\^fe®/= »e» y ' 



EE G(Sf)Sf^ = ^Sf dG(S?)- □ 
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Appendix B: Language produced by a Toric Grammar 

In this appendix, we make a deterministic study of the language produced by 
a toric grammar Sf £ /3* (T) . More precisely, we are interested in the support of 
the distribution Tg? of the final state of the production process. 

Lemma B.l 

Let £? £ X be some text and <S £ /3*(&) be some grammar obtained by splitting 
this text a finite number of times. The number of splits performed can be read 
in & and is equal to 

+ oo 
i=l 

Let us put a{W) = a n (fS). Then, 2f £ a{^) C X, moreover a(&) = supp(T^). 

Proof The grammar is obtained by making a succession of splits. Each of 
those splits add one [j and one ]j to the grammar, whereas in the original text 
there are no [j nor ]j, except for the [o at the beginning of each sentence. Since 
application of an element of 5 does not change the number of such symbols, 
they may be used to count the number of splits performed. 

Let us take then a sequence of toric grammars 2T = %,..., & n = @, such that 
'Sh £ From lemma 4.2 on page 10, there is a sequence fi, ■ ■ ■ , f n £ 5 

such that fk (S^fc-i) £ ct(&k) ■ Let us prove by induction that for any k = 0, . . . , n, 

fkO---of 1 (£r)£a k (%). 

Indeed, this is true for k = 0, since % = & ■ Moreover, assuming that the 
assertion holds for k — 1, we deduce that 

fkO---of 1 (#)ef k (a k - 1 (? k - 1 )) Ca^^/^Sffc-i)) Ca k (^ k ). 

showing that if the assertion holds for k, it also holds for k + 1. For k = n, we 
obtain that 

/ n o-o/i(^)ea n (Sf n ). 

As/ n o.--o/ 1 (^)=^, since ST is a text, and <S n = <S ', we get that Jea" (Sf ) . 

Let us consider now <S' £ a(S?). Let = %, . . . = <S') the chain of 
grammars leading to C S' . Then for any k = 0, . . . , n, 

+ OC 

£> fc (]iS*) =n-k, 
»=i 

since % £ a(@h-i) and each merge takes away one [, and one ]j. This implies 

that J2t=^ ( \i s *) = °> and thus Sf ' € X. 

Note that, as remarked above, repeated merges may create elements of the 
type [ia]ib. However, this will not happen if n successful merges can be per- 
formed. Indeed in the case when expressions of the form [id]ib remain un- 
matched during the merge process, we will get a(& k ) — for some k < n. □ 
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